1 Getting ready

First we load the necessary packages.

library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1  
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0 
library(dada2) # Accurate, high-resolution sample inference from amplicon sequencing data, Bioconductor v1.18.0 
library(plyr) # Tools for Splitting, Applying and Combining Data, CRAN v1.8.6 
library(DT) # A Wrapper of the JavaScript Library 'DataTables', CRAN v0.17 
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3 
library(biomformat) # An interface package for the BIOM file format, Bioconductor v1.18.0 

# Set seed
set.seed(1910)

Set the path to the fastq files:

path <- here::here("data/raw/casava-18-paired-end-demultiplexed-run2")
head(list.files(path))
## [1] "AqFl1-001_S1_L001_R1_001.fastq.gz"  "AqFl1-001_S1_L001_R2_001.fastq.gz" 
## [3] "AqFl1-004_S12_L001_R1_001.fastq.gz" "AqFl1-004_S12_L001_R2_001.fastq.gz"
## [5] "AqFl1-006_S23_L001_R1_001.fastq.gz" "AqFl1-006_S23_L001_R2_001.fastq.gz"

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl1-001" "AqFl1-004" "AqFl1-006" "AqFl1-008" "AqFl1-010" "AqFl1-012"

2 Inspect read quality

We start by visualizing the quality profiles of the forward reads:

plotQualityProfile(fnFs[1:2])

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quantiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).

The forward reads are good quality. We generally trim the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed. We will truncate the forward reads at position 290 (trimming the last 10 nucleotides).

Now we visualize the quality profile of the reverse reads:

plotQualityProfile(fnRs[1:2])

The reverse reads are of significantly worse quality, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 248 where the quality distribution crashes.

Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.

3 Filter and Trim

Assign filenames for the filtered fastq files.

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

We’ll use standard filtering parameters: maxN = 0 (DADA2 requires no Ns), truncQ = 2, rm.phix = TRUE and maxEE = 2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft = c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20, 18),
                     truncLen = c(290,248), maxN = 0, maxEE = c(2,2), truncQ = 2, 
                     rm.phix = TRUE, compress = TRUE, multithread = TRUE) 
head(out)
##                                    reads.in reads.out
## AqFl1-001_S1_L001_R1_001.fastq.gz     61001     50347
## AqFl1-004_S12_L001_R1_001.fastq.gz    78819     66715
## AqFl1-006_S23_L001_R1_001.fastq.gz    66665     48526
## AqFl1-008_S34_L001_R1_001.fastq.gz    34873     27090
## AqFl1-010_S45_L001_R1_001.fastq.gz    64724     43901
## AqFl1-012_S56_L001_R1_001.fastq.gz    74970     61428

Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE = c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.

4 Learn the error rates

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

errF <- learnErrors(filtFs, multithread = TRUE)
## 106911360 total bases in 395968 reads from 8 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 109339700 total bases in 475390 reads from 10 samples will be used for learning the error rates.

It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:

plotErrors(errF, nominalQ = TRUE)

plotErrors(errR, nominalQ = TRUE)

The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.

5 Sample inference

We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. Here, we set pool = TRUE to retain the singletons for the downstream species richness estimation.

dadaFs <- dada(filtFs, err = errF, pool = TRUE, multithread = TRUE) 
## 74 samples were pooled: 3111994 reads in 315261 unique sequences.
dadaRs <- dada(filtRs, err = errR, pool = TRUE, multithread = TRUE)
## 74 samples were pooled: 3111994 reads in 466276 unique sequences.

Inspect the dada-class object

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 882 sequence variants were inferred from 7247 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

6 Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)

# Inspect the merger data.frame from the first sample
head(mergers[[1]])
##                                                                                                                                                                                                                                                                                                                                           sequence
## 1         GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGGACTGAGACACGGCCCAG
## 2                               GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTCAGCTTGCTGGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCTTGCACTCTGGGATAAGCTTGGGAAACTGGGTCTAATACCGGATATGAACTGCCTTTAGTGTGGTGGTTGGAAAGTTTTTTCGGTGCAAGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 3 GATGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAAGAGGCCCTTAGACTTGGTGCTTGCACTAAAGACAAGGATAACCTCTTAGTGGCGAACGGGTGAGTAACACGTAGGTAACCTACAACTAAGACGAGGACAACAGTTGGAAACGACTGCTAATACTGGATAGGAATTAAGATTGCATATTCTTGATTTTAAAGTAGCTTCTGGCTACACTTAGAGATGGACCTGCGGCGCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACATTGGGACTGAGACACGGCCCAA
## 4     GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGGAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
## 5                             GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTGCAGCTTGCTGCGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTCAACTTTGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATATGACCTTTCTTTAGTGTGGAAGGTGGAAAGCTTTTGCGGTTGGGGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 6     GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1      3335       1       1    172         0      0      1   TRUE
## 2      2463       5       6    194         0      0      1   TRUE
## 3      1863       4      11    164         0      0      1   TRUE
## 4      1836       3       4    168         0      0      1   TRUE
## 5      1779       2       3    192         0      0      1   TRUE
## 6      1603       6       2    168         0      0      2   TRUE

Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?

7 Construct sequence table

We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.

dim(seqtab)
## [1]   74 7124

Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab))) %>% 
  as.data.frame() %>% 
  rename("seqence length" = Var1) %>% 
  datatable(options = list(
    columnDefs = list(list(className = 'dt-left', targets = c(0:2)))
    ))

Plot sequence length distribution

seqLen <- nchar(getSequences(seqtab)) %>% 
  as.data.frame() %>% 
  rename(seqLen = ".") 

ggplot(seqLen, aes(x = seqLen)) + 
  geom_histogram(binwidth = 1, alpha = 0.2, position = "identity", colour = "red") +
  geom_vline(aes(xintercept = mean(seqLen)), color = "blue", linetype = "dashed", size = 0.5) +
  scale_x_continuous(breaks = seq(round_any(min(seqLen), 10), 
                                round_any(max(seqLen)*1.05, 10, f = ceiling), 
                                round_any((max(seqLen)-min(seqLen))/10, 10, f = ceiling))) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), 
                     breaks = seq(0, 
                                round_any(max(table(seqLen))*1.05, 10, f = ceiling), 
                                round_any(max(table(seqLen))/10, 10, f = ceiling))) +
  labs(x = "sequence length (bp)", title = "Amplicon length distribution") +
  annotate("text", label = "mean length", x = mean(seqLen$seqLen)-2, 
           y = round_any(max(table(seqLen)), 10, f = ceiling), hjust = 1, colour = "blue") +
  theme_bw() 

Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes, most of which fall between 270 and 345 with a long tail on the right. We’ll just leave them as they are.

8 Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool = TRUE during the sample inference, we should use method = "pooled" for the chimera removal.

seqtab.nochim <- removeBimeraDenovo(seqtab, method = "pooled", multithread = TRUE, verbose = TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9519334
dim(seqtab.nochim)
## [1]   74 3594

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.

Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.

9 Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN <- function(x) sum(getUniques(x))
stats <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(stats) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(stats) <- sample.names
datatable(stats)

Plot the sequences stats:

p_stats <- stats %>% 
  as.data.frame() %>%
  rownames_to_column("SampleID") %>%
  mutate_at(vars("filtered":"nonchim"), ~100*.x/input) %>% 
  mutate(input = 100) %>%
  gather(key = "step", value = "percent", -SampleID) %>%
  mutate(step = factor(step, levels = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim"))) %>%
  ggplot(aes(x = step, y = percent, color = SampleID)) +
    geom_point() +
    geom_line(aes(group = SampleID)) +
    scale_y_continuous(breaks = 0:10*10) +
    labs(x = "", y = "Reads retained (%)") +
    theme_bw()

ggplotly(p_stats, tooltip = c("x", "y", "colour"))

Except for the 7 samples on the bottom of the plot, 6 of which are negative controls, we kept the majority of our raw reads. Sample AqFl1-062 lost a large fraction of raw reads during filtering, suggesting worse read quality than other samples.

Considerations: This is a great place to do a last sanity check. Outside of filtering, there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.

10 Export data

Export feature table:

t(seqtab.nochim) %>%
  make_biom() %>%
  write_biom(here::here("data/intermediate/dada2/table-run2.biom"))

Export representative sequences:

uniquesToFasta(seqtab.nochim, fout = here::here("data/intermediate/dada2/rep-seqs-run2.fna"), 
               ids = colnames(seqtab.nochim))

11 Acknowledgements

The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.12). For more documentations and tutorials, visit the DADA2 website.

12 Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=nb_NO.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] biomformat_1.18.0 plotly_4.9.3      DT_0.17           plyr_1.8.6       
##  [5] dada2_1.18.0      Rcpp_1.0.6        forcats_0.5.1     stringr_1.4.0    
##  [9] dplyr_1.0.5       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3      
## [13] tibble_3.1.0      ggplot2_3.3.3     tidyverse_1.3.0   here_1.0.1       
## [17] knitr_1.31       
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-0            hwriter_1.3.2              
##   [3] ellipsis_0.3.1              rprojroot_2.0.2            
##   [5] htmlTable_2.1.0             XVector_0.30.0             
##   [7] GenomicRanges_1.42.0        base64enc_0.1-3            
##   [9] fs_1.5.0                    rstudioapi_0.13            
##  [11] farver_2.1.0                fansi_0.4.2                
##  [13] lubridate_1.7.10            xml2_1.3.2                 
##  [15] codetools_0.2-18            splines_4.0.3              
##  [17] Formula_1.2-4               jsonlite_1.7.2             
##  [19] Rsamtools_2.6.0             broom_0.7.5                
##  [21] cluster_2.1.0               dbplyr_2.1.0               
##  [23] png_0.1-7                   compiler_4.0.3             
##  [25] httr_1.4.2                  backports_1.2.1            
##  [27] lazyeval_0.2.2              assertthat_0.2.1           
##  [29] Matrix_1.3-2                cli_2.3.1                  
##  [31] htmltools_0.5.1.1           tools_4.0.3                
##  [33] gtable_0.3.0                glue_1.4.2                 
##  [35] GenomeInfoDbData_1.2.4      reshape2_1.4.4             
##  [37] ShortRead_1.48.0            Biobase_2.50.0             
##  [39] cellranger_1.1.0            jquerylib_0.1.3            
##  [41] rhdf5filters_1.2.0          vctrs_0.3.6                
##  [43] Biostrings_2.58.0           debugme_1.1.0              
##  [45] crosstalk_1.1.1             xfun_0.22                  
##  [47] ps_1.6.0                    rvest_1.0.0                
##  [49] lifecycle_1.0.0             zlibbioc_1.36.0            
##  [51] scales_1.1.1                hms_1.0.0                  
##  [53] MatrixGenerics_1.2.0        parallel_4.0.3             
##  [55] SummarizedExperiment_1.20.0 rhdf5_2.34.0               
##  [57] RColorBrewer_1.1-2          yaml_2.2.1                 
##  [59] gridExtra_2.3               sass_0.3.1                 
##  [61] rpart_4.1-15                latticeExtra_0.6-29        
##  [63] stringi_1.5.3               highr_0.8                  
##  [65] S4Vectors_0.28.0            checkmate_2.0.0            
##  [67] BiocGenerics_0.36.0         BiocParallel_1.24.1        
##  [69] GenomeInfoDb_1.26.0         rlang_0.4.10               
##  [71] pkgconfig_2.0.3             bitops_1.0-6               
##  [73] matrixStats_0.58.0          evaluate_0.14              
##  [75] lattice_0.20-41             Rhdf5lib_1.12.0            
##  [77] labeling_0.4.2              GenomicAlignments_1.26.0   
##  [79] htmlwidgets_1.5.3           tidyselect_1.1.0           
##  [81] magrittr_2.0.1              R6_2.5.0                   
##  [83] IRanges_2.24.0              generics_0.1.0             
##  [85] Hmisc_4.5-0                 DelayedArray_0.16.0        
##  [87] DBI_1.1.1                   pillar_1.5.1               
##  [89] haven_2.3.1                 foreign_0.8-81             
##  [91] withr_2.4.1                 survival_3.2-7             
##  [93] RCurl_1.98-1.3              nnet_7.3-15                
##  [95] modelr_0.1.8                crayon_1.4.1               
##  [97] utf8_1.2.1                  rmarkdown_2.7              
##  [99] jpeg_0.1-8.1                grid_4.0.3                 
## [101] readxl_1.3.1                data.table_1.14.0          
## [103] reprex_1.0.0                digest_0.6.27              
## [105] RcppParallel_5.0.3          stats4_4.0.3               
## [107] munsell_0.5.0               viridisLite_0.3.0          
## [109] bslib_0.2.4